PSC 103B
We can access this dataset by installing the palmerspenguins package.
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Outcome variable: bill_length_mm
Not all penguins gave data on bill length and there are some missing values.
The complete.cases() function gives the row numbers where there is non-missing values on the variable you give it.
Suppose we were interested in whether male penguins or female penguins had different bill lengths.
We suspected that male penguins have longer bill lengths than female penguins.
Let’s look at both means
Another way to do this is to use the tapply() function.
tapply(variable, group, function, extra arguments for the function)
Is the numerical difference of ~4 mm actually significant?
\(H_0: \mu_{female} = \mu_{male}\), or the average bill length of females is the same as the average bill length of males.
\(H_1: \mu_{female} < \mu_{male}\), or the average bill length of females is less than that of males.
The t-test is trying to see whether the difference you observed between the groups is large given the expected variability of that difference across samples.
Our hypothesis was that females have shorter bill lengths than males.
R views the females as Group 1 and males as Group 2 (because female is alphabetically before male). We need to decide our alternative with Group 1 compared to Group 2.
Tip
The argument alternative specify the alternative hypothesis and can take any of these three values: "two.sided", "less", or "greater". Think about our hypothesis to choose one the alternatives.
Welch Two Sample t-test
data: bill_length_mm by sex
t = -6.6725, df = 329.29, p-value = 5.332e-11
alternative hypothesis: true difference in means between group female and group male is less than 0
95 percent confidence interval:
-Inf -2.82883
sample estimates:
mean in group female mean in group male
42.09697 45.85476
The Welch Two Sample t-test found that female penguins (M = 42.1, SD = 4.90) have, on average, shorter bill lenghts than male penguins (M = 45.9, SD = 5.37), t(329.29) = -6.67, p < .001.
Notice that R gives us by default the Welch’s t-test.
It is used when the number of samples in each group is different, and the variance of the two data sets is also different. Usually that is a safe assumption.
If you want to assume equal variances, set the argument var.equal = TRUE.
What should we do if we have more than two groups we are interested in comparing?
Our question is the same as a t-test - are there differences in the average score across the groups - but we can’t use a t-test, because a t-test is limited to 2 groups.
Running multiple t-tests increases our Type 1 error rate - the probability of finding a significant difference when there is none.
One-way ANOVA let’s us examine whether multiple groups differ in their average scores.
Let us apply this to the example of whether bill length differs across the different species of penguins.
\(H_0: \mu_{Adelie} = \mu_{Chinstrap} = \mu_{Gentoo}\) or in other words, the average bill length is the same for all 3 species of penguins. The alternative hypothesis is:
\(H_A\): At least one of the means is different, or \(H_0\) is not true.
On face value, the means of the groups are different from each other, but there is also a lot of variability within each group around that group mean.
ANOVA quantifies how much of the variation that we see between groups is due to actual, significant group differences and how much of it is just due to sampling variation.
If \(H_0\) were true, we would expect the amount of variance due to individual differences to be larger than the amount of variance that is due to group differences
If \(H_0\) were not true, and there were actual group differences, then we expect the variation between groups to be larger than the residual variance (which is the variance due to sampling error/non-group differences).
Let’s walk through an example
We can compare between and within groups variability with the F-ratio:
\[ \begin{aligned} F &= \frac{\text{Between groups variability}}{\text{Within groups variability}}\\ &= \frac{\text{Group effects + Ind diffs + Error}}{\text{Ind diffs + Error}} \end{aligned} \]
If the group effect is zero, F-ratio will be close to one.
\[ F = \frac{MS_{Between}}{MS_{Within}} = \frac{\frac{SS_{Between}}{df_{Between}}}{\frac{SS_{Within}}{df_{Within}}} \]
If \(H_0\) were true, then our best guess for the score of a new penguin would be the grand mean (or the mean of the entire sample), since group membership wouldn’t tell us anything useful.
We can compare each group’s mean to this grand mean.
If the group means are all similar, then the variance will be small.
If the group means are different, then the variance will be large.
First, let us calculate the mean of each group.
How would you do it in R?
We can make a dataframe that contains the group means and the grand means, to make it easier to calculate the \(SS_{Between}\).
penguin_means <- data.frame(GroupMean =
tapply(penguins_subset$bill_length_mm,
penguins_subset$species, mean,
na.rm = TRUE),
GrandMean = mean(penguins_subset$bill_length_mm,
na.rm = TRUE))
penguin_means GroupMean GrandMean
Adelie 38.79139 43.92193
Chinstrap 48.83382 43.92193
Gentoo 47.50488 43.92193
Exercise
Create a new column in penguin_means named mean_deviations containing the difference between each group mean and the grand mean.
The mean deviations do not tell us much yet. We are first trying to estimate the \(SS_{Between}\)
If you recall from class, this tell us the variability of group means around the grand mean scaled by group sample size:
\[ SS_{Between} = \sum^k_{j=1}n_j(\bar{X_j}-\bar{X})^2 \]
table() function.Now we need to calculate \(SS_{Within}\), or the residual variance, the difference from each individual’s score to their group’s mean.
To calculate this, we need to get each penguin’s observation and each penguin’s group mean in the same dataframe.
Exercise
Calculate the sum of squared deviations from the group mean separately for each group, and save them in three different objects: penguins_adelie_dev, penguins_chinstrap_dev, and penguins_gentoo_dev.
penguins_adelie_dev <- sum((penguins_adelie -
mean(penguins_adelie, na.rm = TRUE))^2, na.rm = TRUE)
penguins_chinstrap_dev <- sum((penguins_chinstrap -
mean(penguins_chinstrap, na.rm = TRUE))^2, na.rm = TRUE)
penguins_gentoo_dev <- sum((penguins_gentoo -
mean(penguins_gentoo, na.rm = TRUE))^2, na.rm = TRUE)And now we add up all these deviations to get SSW
PSC 103B - Statistical Analysis of Psychological Data